Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup snapshot stale indices delete #64513

Closed
wants to merge 7 commits into from

Conversation

piyushdaftary
Copy link
Contributor

Fixes #61513

  • Have you signed the contributor license agreement? Yes
  • Have you followed the contributor guidelines? Yes
  • If submitting code, have you built your formula locally prior to submission with gradle check? Yes
  • If submitting code, is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed. Yes
  • If submitting code, have you checked that your submission is for an OS and architecture that we support?Yes
  • If you are submitting this code for a class then read our policy for that. Yes

Current implementation cleanupStaleIndices() of snapshot deletion is very slow . Snapshot deletion code deletes each stale indices from repository one after another sequentially .

With this code changes instead of making snapshot delete of stale indices a single threaded operation we make it multithreaded operation and delete multiple stale indices in parallel using SNAPSHOT thread pool's workers.

I have added more tests to make sure outage scenario and failover scenario is handled.

@tvernum tvernum added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Nov 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 3, 2020
@original-brownbear original-brownbear self-requested a review November 3, 2020 06:29
Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @piyushdaftary , one ask :)

@original-brownbear
Copy link
Member

original-brownbear commented Dec 1, 2020

Thanks @piyushdaftary and sorry for letting this fall of my radar for a bit. I'll do my best to review this one today.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work @piyushdaftary ! Sorry again it took me so long to review this.

Just a few open details here, but the general solution is exactly what I had in mind, thanks!

@original-brownbear
Copy link
Member

Jenkins test this

@original-brownbear
Copy link
Member

Jenkins run elasticsearch-ci/bwc (dep download failure only)

@original-brownbear
Copy link
Member

@piyushdaftary thanks this looks really nice now, could you please fix the checkstyle issues that are currently failing CI runs:

22:34:09 [ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RepositoriesIT.java:24:8: Unused import - org.elasticsearch.action.admin.cluster.snapshots.create.CreateSnapshotResponse. [UnusedImports]
22:34:09 [ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RepositoriesIT.java:48:15: Unused import - org.hamcrest.Matchers.greaterThan. [UnusedImports]
22:34:09 

(just some unused imports in the test class I think). You can make sure they all got fixed locally by running ./gradlew precommit btw.

Thanks again!

@piyushdaftary
Copy link
Contributor Author

@piyushdaftary thanks this looks really nice now, could you please fix the checkstyle issues that are currently failing CI runs:

22:34:09 [ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RepositoriesIT.java:24:8: Unused import - org.elasticsearch.action.admin.cluster.snapshots.create.CreateSnapshotResponse. [UnusedImports]
22:34:09 [ant:checkstyle] [ERROR] /dev/shm/elastic+elasticsearch+pull-request-1/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RepositoriesIT.java:48:15: Unused import - org.hamcrest.Matchers.greaterThan. [UnusedImports]
22:34:09 

(just some unused imports in the test class I think). You can make sure they all got fixed locally by running ./gradlew precommit btw.

Thanks again!

Thanks @original-brownbear.
Removed the unused imports.

@original-brownbear
Copy link
Member

Jenkins test this

@original-brownbear
Copy link
Member

@piyushdaftary thanks, looks good :)

There is one test failure here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/14390/testReport/junit/org.elasticsearch.repositories.fs/FsBlobStoreRepositoryIT/testIndicesDeletedFromRepository/ that looks related to the changes unfortunately. Could you look into this one please? The failure has instructions on how to reproduce it locally, check the reproduce with message in the CI logs. Let me know if you need any help with this, then I'll see if I can track it down.

@piyushdaftary
Copy link
Contributor Author

piyushdaftary commented Dec 14, 2020

Hi @original-brownbear
I tried to reproduce the test failure org.elasticsearch.repositories.fs.FsBlobStoreRepositoryIT.testIndicesDeletedFromRepository locally, but unfortunately I am unable to reproduce it on my machine.

Below is cmd I am trying to run the test with, it succeeds every time for me :

38f9d37053af:elasticsearch pdaftary$ ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.repositories.fs.FsBlobStoreRepositoryIT.testIndicesDeletedFromRepository" -Dtests.seed=A6BCE803F5F88E4C -Dtests.security.manager=true -Dtests.locale=es-SV -Dtests.timezone=Australia/Canberra -Druntime.java=11
=======================================
Elasticsearch Build Hamster says Hello!
  Gradle Version        : 6.6.1
  OS Info               : Mac OS X 10.14.6 (x86_64)
  Runtime JDK Version   : 11 (Oracle JDK)
  Runtime java.home     : /Library/Java/JavaVirtualMachines/jdk-11.0.5.jdk/Contents/Home
  Gradle JDK Version    : 14 (OpenJDK)
  Gradle java.home      : /Library/Java/JavaVirtualMachines/jdk-14.0.1.jdk/Contents/Home
  Random Testing Seed   : A6BCE803F5F88E4C
  In FIPS 140 mode      : false
=======================================

BUILD SUCCESSFUL in 1m 5s

@piyushdaftary
Copy link
Contributor Author

piyushdaftary commented Dec 16, 2020

@original-brownbear : Can you please let me know how else I can reproduce the issue on my local machine. Till date the test org.elasticsearch.repositories.fs.FsBlobStoreRepositoryIT.testIndicesDeletedFromRepository is passing in my machine when ran against my branch code.

Do I need to merge in latest elastic/elasticsearch - master branch code to reproduce it ?

@original-brownbear
Copy link
Member

@piyushdaftary

sorry for the delay here, had a few high pressure things incoming at the start of this week, will do my best to look into this today.

Do I need to merge in latest elastic/elasticsearch - master branch code to reproduce it ?

No, CI runs your exact code without any merging beforehand so it should be reproducible in theory. In practice it's often a little tricky :) I'll try to think through the change today, could be we've just change some timing in such a way that it surfaced an existing issue. I'm on it later.

@original-brownbear
Copy link
Member

Jenkins test this

@original-brownbear
Copy link
Member

@piyushdaftary this looks good, tests seem to be passing now. Could you merge in latest master though? I think the fact that we're out of data a little there is causing some tests to fail again.

Sync from elastic master
@piyushdaftary
Copy link
Contributor Author

@original-brownbear : Thanks for reviewing code.
I have merged in latest master code to my branch.

@original-brownbear
Copy link
Member

@elasticmachine test this

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @piyushdaftary! Looks very nice now, I'll merge this in the coming days.

@piyushdaftary
Copy link
Contributor Author

piyushdaftary commented Jan 11, 2021

@original-brownbear: Thanks for approving the code changes. By when will it be merged to master ?

@DaveCTurner
Copy link
Contributor

@original-brownbear looks like this one has slipped off the radar.

@pugnascotia
Copy link
Contributor

@original-brownbear are you still looking at this?

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:12
@DaveCTurner
Copy link
Contributor

Sorry this has taken so long @piyushdaftary, this PR is now rather stale. I've opened one afresh at #100316 and marked you as a co-author. Closing this.

@DaveCTurner DaveCTurner closed this Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make snapshot deletion faster
6 participants